comis

An R Package to Read CCCCO MIS Files

Christian Million
Data Analyst

Yosemite Community College District

Goals

Main Goal

  • Showcase benefits of developing internal packages with R.

Along the way…

  • Inspire!

  • Use comis as a motivating example

  • Why is package development worthwhile?

  • Learn too much about MIS files.

library(comis)

# Read Referential File
CB <- read_ref("path/to/CB223.txt")

# Read Submission File
XB <- read_sub("path/to/U59217XB.DAT")

What is comis?

An internally developed R package

Purpose

Read and Format:

  • MIS Submission Files

  • MIS Referential Files

MIS 101

Submission Files: Districts submit MIS data to state via these files

- ~ 25 files | 396 elements


Referential Files: Districts retrieve these files from Data-On-Demand

- ~ 27 files | 406 elements

The Challenge

  • MIS Data is important

  • We want to easily use it / analyze it

  • The Challenge: Reading the MIS data into R is difficult and error prone

Submission Files - Challenges

  • Fixed Width Format

  • No Column Names

  • Numbers that should be characters / dates

  • Missing values (NA)

  • Trailing white space

  • Implied decimal points

Referential Files - Challenges

  • Tab Delimited :)

  • No Column Names

  • Numbers that should be characters / dates

  • Missing values (NA)

  • Trailing white space

  • Implied decimal points

  • Different date format than submission file.

Yikes

Imagine writing code to handle this for each analysis:

  • A lot to re-remember

  • Cognitively taxing to implement

  • Takes time

  • Updates to multiple scripts

  • Copy / paste errors

  • Makes scripts more difficult to read

  • Unfulfilling

  • Lots of overhead before analysis can begin

Before comis

# Load Libraries -----
library(dplyr)
library(readr)


# Define Names, Types, and Widths -----
CB_col_names <- c('GI90', 'GI01','GI03', paste0("CB0",0:9), paste0("CB",10:27), "Filler")
CB_col_types <- rep("c", length(CB_col_names))
CB_col_width <- CB <- c(2,3,3,12,12,68,6,1,1,length(109:112),length(113:116),1,1,1,1,1,1,6,8,length(137:148),length(149:160),length(161:172),7,9,1,1,1,1,1,1,1,26)

XB_col_names <- c('GI90', 'GI01', 'GI03', 'GI02', 'CB01', paste0('XB0',0:9), 'XB10', 'XB11', 'XB12', 'CB00', 'Filler')
XB_col_types <- rep("c", length(XB_col_names))
XB_col_width <- c(2,3,3,3,12,6,1,6,6,1,length(44:47), length(48:51),1,1,1,1,length(56:61), 1, 12,7)

# Read the source data -----
CB_src <- readr::read_tsv("path/to/U59223CB.DAT",
                           col_names = CB_col_names,
                           col_types = CB_col_types,
                           trim_ws = TRUE)
                           
XB_src <- readr::read_tsv("path/to/U59223XB.DAT",
                           col_names = CB_col_names, # copy / paste errors
                           col_types = XB_col_types,
                           trim_ws = TRUE)

# Clean and Reformat Data -----
CB <- CB_src |>
    mutate(dates = date_cleaning_code(),
           units = implicit_decimal_code())
           
XB <- XB_src |>
    mutate(dates = date_cleaning_code(),
           units = implicit_decimal_code())

After comis

# Load Libraries -----
library(comis)

# Read Data -----
CB <- read_sub("path/to/U59223CB.DAT")
XB <- read_sub("path/to/U59223XB.DAT")

Additional Features

  • Contains useful data found on CCCCO websites

  • Read many files at once

  • Read from repo

  • Use DED Name or Descriptive Name

library(dplyr)
library(comis)

read_ref_repo("CB", c("217", "223")) |>
    left_join(top_codes, by = c("CB03" = "top_code")) |>
    left_join(colleges, by = c("GI01")) |>
    filter(vocational == "Y",
           institution == "COLUMBIA")
library(comis)

# Reads many files of same "domain" at once
read_sub(c("U59223CB.DAT", "U59217CB.DAT"))


read_ref(c("CB217.txt", "CB223.txt"))
library(comis)

# Set in .Rprofile or .Renviron
options(comis.repo.referential = "path/to/ref/repo/")

read_ref_repo("CB", c("217", "223"))
library(comis)

# Column names are DED Codes.
# like "GI01", "CB00", "CB01"
read_ref_repo("CB", "217")

# Column names are words.
# like "COLLEGE_ID", "COURSE_ID", "CONTROL_NUMBER"
read_ref_repo("CB", "217", desc = TRUE)

Benefits of comis

  • Easier to tell what’s happening

  • Reduces cognitive overhead

  • Get to analysis faster and with more confidence

  • Documentation contained within the package

  • Updates made in one spot (instead of throughout various scripts)

  • Shifts focus to what’s important - Using the Data

Why Develop Internal R Packages?

  • Addresses problems specific to the institution

  • Reasonable defaults

  • Abstracts common tasks

  • Maintainable

  • Easily share code with others

  • Business logic is located in one place

Thanks!

Contact


Christian Million

Data Analyst

Yosemite Community College District